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Abstract 

Neuroimage analysis usually involves learning thou¬ 
sands or even millions of variables using only a lim¬ 
ited number of samples. In this regard, sparse models, 
e.g. the lasso, are applied to select the optimal features 
and achieve high diagnosis accuracy. The lasso, how¬ 
ever, usually results in independent unstable features. 
Stability, a manifest of reproducibility of statistical re¬ 
sults subject to reasonable perturbations to data and the 
model (Yu 2013) , is an important focus in statistics, 
especially in the analysis of high dimensional data. In 
this paper, we explore a nonnegative generalized fused 
lasso model for stable feature selection in the diagno¬ 
sis of Alzheimer’s disease. In addition to sparsity, our 
model incorporates two important pathological priors: 
the spatial cohesion of lesion voxels and the positive 
correlation between the features and the disease labels. 
To optimize the model, we propose an efficient algo¬ 
rithm by proving a novel link between total variation 
and fast network flow algorithms via conic duality. Ex¬ 
periments show that the proposed nonnegative model 
performs much better in exploring the intrinsic struc¬ 
ture of data via selecting stable features compared with 
other state-of-the-arts. 


Introduction 


Neuroimage analysis is challenging due to its high fea¬ 
ture dimensionality and data scarcity. Sparse models such 
as the lasso ( [Tibshirani 1996) have gained great reputa¬ 
tion in statistics and machine learning, and they have been 
applied to the analysis of such high dimensional data by 
exploiting the sparsity property in the absence of abun¬ 
dant data. As a major result, automatic selection of rele¬ 
vant variables/features by such sparse formulation achieves 
promising performance. For example, in ( |Liu, Zh ang, and 


She n 2012] ), the lasso model was applied to the diagno¬ 


sis of Alzheimer’s disease (AD) and showed better perfor¬ 
mance than the support vector machine (SVM), which is one 
of the state-of-the-arts in brain image classification. How¬ 
ever, in statistics, it is known that the lasso does not always 
provide interpretable results because of its instability ( fYu| 
|2013| ). “Stability” here means the reproducibility of statis¬ 
tical results subject to reasonable perturbations to data and 
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the model. (These perturbations include the often used Jack- 
nife, bootstrap and cross-validation.) This unstable behavior 
of the lasso model is critical in high dimensional data anal¬ 
ysis. The resulting irreproducibility of the feature selection 
are especially undesirable in neuroimage analysis/diagnosis. 
However, unlike the problems such as registration and clas¬ 
sification, the stability issue of feature selection is much less 
studied in this field. 


In this paper we propose a model to induce more sta¬ 
ble feature selection from high dimensional brain structural 
Magnetic Resonance Imaging (sMRI) images. Besides spar¬ 
sity, the proposed model harnesses two important additional 
pathological priors in brain sMRI: (i) the spatial cohesion of 
lesion voxels (via inducing fusion terms) and (ii) the pos¬ 
itive correlation between the features and the disease la¬ 
bels. The correlation prior is based on the observation that 
in many brain image analysis problems (such as AD, fron¬ 
totemporal dementia, corticobasal degeneration, etc), there 
exist strong correlations between the features and the labels. 
For example, gray matter of AD is degenerated/atrophied. 
Therefore, the gray matter values (indicating the volume) 
are positively correlated with the cognitive scores or dis¬ 
ease labels {-1,1}. That is, the less gray matter, the lower 
the cognitive score. Accordingly, we propose nonnegative 
constraints on the variables to enforce the prior and name the 
model as “non-negative Generalized Fused Lasso” (n 2 GFL). 
It extends the popular generalized fused lasso and enables it 
to explore the intrinsic structure of data via selecting sta¬ 
ble features. To measure feature stability, we introduce the 
“Estimation Stability” recently proposed in ( Yu 201 3| and 
the (multi-set) Dice coefficient ( [Dice 1945) . Experiments 
demonstrate that compared with existing models, our model 
selects much more stable (and pathological-prior consistent) 
voxels. It is worth mentioning that the non-negativeness per 
se is a very important prior of many practical problems, 


e.g. (Lee and Seung 1999). Although n z GFL is proposed 
to solve the diagnosis of AD in this work, the model can be 
applied to more general problems. 


Incorporating these priors makes the problem novel w.r.t 
the lasso or generalized fused lasso from an optimization 
standpoint. Although off-the-shelf convex solvers such as 
CVX ( [Grant and Boyd 2013| can be applied to solve the 
optimization, it hardly scales to high-dimensional problems 
in feasible time. In this regard, we propose an efficient algo- 























rithm that solves the n 2 GFL problem exactly. We general- 
ize the proximal g radient methods (such as FISTA) (Beck 


[and Teboulle 2009) to solve our constrained optimization 
and prove its convergence. We then show that by using an 
element-wise post-processing, the resulting proximal oper¬ 
ator can be reduced to the total variation (TV) problem. It 
is known that TV can be solved by parametric flow algo¬ 
rithms ( [Chambolle and Darbon 200j?||Xin et al. 2014| ). In the 
present study, we provide a novel equivalence via conic du¬ 
ality, which gives us a minimum quadratic cost flow formu¬ 
lation plochbaum and Hong 1995|). Fast flow algorithms (in¬ 
cluding parametric flow) are then easily applied. In practice, 
our algorithm runs hundreds of times faster than CVX at the 
same precision and can scale to high-dimensional problems. 

Related work. In addition to sparsity, people leverage 
underlying data structures and introduce stronger jriors 
such as the structured sparsity ( [Jacob, Qbozinski, and Vert 


2009| to increase model stability. However, for voxel-based 


sMRI data analysis, handcrafted grouping of the voxels or 
sub-structures may not coincide with various pathological 
topology priors. Consequently, group lasso (with overlap) 
([Jacob, Qbozinski, and Vert 2009[ Jenatton et al. 2012] 
[Rao etal. 2013 ) is not an ideal model to the problem. In con¬ 
trast, the graph-based structured sparse models adapt better 
to such a situation. The most popular one is referred here 
as LapL0 which adopts 1 2 normregularizationof neighbor¬ 
hood variable difference (e.g. ( [Ng and Abugharbieh 2011 [ 
[Grosenick et al. 2013| ). However, as we will show in the ex¬ 
periments, these models select many more features than nec¬ 
essary. Very recently, generalized fused lasso or total vari¬ 
ation has been successful applied to brain image analyses 
problems inducing the l\ difference (Gramfort, Thirion, and 


Varoqua ux 2013| Xin et al. 2014| ). In the experiments, we 


show that by including an extra nonnegative constraint, the 
features selected by our model is much more stable than that 


of such unconstrained models. A very recent work (Avants 


et al. 2014) also explored this positive correlation (partially 


supporting our assumption), but the problem formulation 
was quite different: neither structural assumption was con¬ 
sidered, nor the stability of feature selection was discussed. 
From the optimization standpoint, the applied framework is 
similar to that of ( |Xin et al. 2014| ) but two key differences 
exist: (1) the FIS TA and soft-thresholding process applied in 
( Xin et al. 2014[ ) do not generalize to constrained optimiza¬ 
tion problems, we show important modifications and provide 
theoretical proof; (2) we propose a novel understanding of 
TV’s relation with flow problems via conic duality and prove 
that the minimum norm point problem of ( [Xin et al. 2014| ) 
is a special case of our framework. 


The Proposed Method 

Nonnegative Generalized Fused Lasso (n 2 GFL) 

Let {(x^,^)}^ 1 be a set of samples, where E and 
yi E M are features and labels, respectively. Also, we denote 

Although different names are given in e.g. |Ng and Abughar- 
bieh 2011; Grosenic k et al. 201 3| >, they are in fact fundamentally 
applying the graph Laplacian smoothing. 


by X E R dxN and y E the concatenations of x* and yi. 
Then, we consider the formulation 

d 

mm l(f3;X,y) + 1/3*1 + A 2 ^ - Pj\, 

/3£R i=1 (iJ)eB 

s.t. (3 > 0 

( 1 ) 

where Ai,A 2 > 0 are tuning parameters. I is a loss term 
of variable (3 (assumed to be convex and smooth). WijS are 
pre-defined weights. Here, the variables (e.g. the sMRI voxel 
parameters in the AD problem) are supposed to have cer¬ 
tain underlying structure represented by a graph G = (V, E) 
with nodes V and edges E. Each variable corresponds to a 
node on the graph. As mentioned above, in many brain im¬ 
age analysis, there exist strong directional correlations (pos¬ 
itive or negative) between the features and the labels, thus 
we assume (3 > 0 (or f3 < 0). Due to the l± penalties on 
each variable as well as each adjacent pair of variables in 0, 
solutions tend to be both sparse and smooth, i.e., adjacent 
variables tend to be similar and spatially coherent. Also be¬ 
cause we have added the nonnegative constraints, the model 
will not select negatively correlated features as support. In 
practice, we notice that unconstrained models will system¬ 
atically select many negatively correlated features. The non¬ 
negative constraints greatly reduce these falsely recovered 
variables and encourage genuine disease-related features to 
be selected. 


Efficient Optimization of n 2 GFL 

The optimization of n 2 GFL is convex and off-the-shelf 
solver such as CVX can be applied. However, this solution 
hardly scales to a problem sized of thousands (mainly due 
to its choice of general second order frameworks), see Table 
[T] In this regard, we propose certain modifications to scal¬ 
able first orderjnethods (e.g. accelerated proximal methods 
( [Beck and Teboulle 2009| )). This is done by exploring La¬ 
grange multiplier method to deal with the constraints. From 
the optimization standpoint, these modifications are non¬ 
trivial and compose one major contribution of this work. 

We first extend the (fast) iterative shrinkage thresholding 
algorithm (ISTA and FISTA) ( [Beck and Teboulle 2009| ) as 
follows. 

Proposition 1. Let (3* be the optimal solution to 0 and 13 k 
defined as follows 
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IA-/H s.t. f3 > 0 , 


where L > 0 is the Lipschitz constant ofVl(-) and k is the 
iteration number. 

Ifz k =(3 k -lVl(l3 k ),thenF(l3 k )-F(l3*) < aL|l/3 °~ r "f 
where Ff) is the objective of ([!]). Ifzfi = y k — LyZ( y k ) 
where y k = (3 k + a k (/3 k — /3 k ~ 1 ) with a controlling the 
momentum, we have F((3 k ) — F((3*) < 




























































The proof can be viewed as an instantiation of the con¬ 
vex analysis introduced in (Nesterov and Nesterov 2004). 
We provide a rigorous proof in the supplementary file. 

Now the key to solve O is how efficiently we solve G- 
If there were no constraint s, (|2| ) is the fused lasso signal ap¬ 
proximation proposed in ( [Friedman et al. 20071 , where it 
was shown that by utilizing the separability of the li norm, 
an element-wise soft-threshold technique can be applied to 
remove the sparse term. Since the constraints of G are also 
separable, we show how G can be further reduced likewise. 
Proposition 2. If we define 


(3 = min 

/3eR d 


WP-Al+f © w ij\Pi-Pj\’ 


(3) 




then the optimal solution to G ( denoted as (3*) can be 
achieved by an element-wise post-processing to (3 as follows 

13 * = max(sign(/3) © max(|/3| - 0), 0); (4) 

±j 

where © is an element-wise product operator. 

Proof We define = 77 and 0^ = X2 ™ ZJ respectively 
for alii G V and (i, j) G E. We denote (3 r as the optimal 
solution of the unconstrained problem of G- According to 
(Friedman et al. 2007 ), f3[ = sign(^) max(|/^| — 0 i: 0). We 
now consider the nonnegative constraints in ([2]). According 
to the Karush-Kuhn-Tucker (KKT) conditions, the necessary 
and sufficient conditions for , ...f3^ are 


Lg% — {fii — Zi) + OiSi + 


Oij tij 


© Oijtji - oh = 0 , s.t. otifii = 0 , 


(5) 


j:(j,i)e£ 

where <© > 0 are the Lagrange multipliers and s, t are sub¬ 
gradients: Si = sign(/%) if fa 0 and © G [—1,1] if (3{ = 
0; Uj = sign(/% - f3j ) for fii J3j and Uj G [-1,1] if 


fii = f3j. The objective equation in §5^ is to set the derivative 
of the Lagrange function to zero and the constraint equations 
are obtained from the complementary slackness condition. 
We now consider two cases of j3\ 

Case 1 (3[ > 0: Note that by setting <© = 0, > 0 satisfies 

the conditions in ([5]), thus /?* = (3[ = sign(^) max(|/^| — 
9i, 0 ) is the solution of ([ 2 ]). 

Case 2 f3[ < 0: We can set f}* = 0 and <© = — f3[ > 0, then 
we have f3[ = /?* — a© 

Lgi 

X 

j:(j,i)e£ 

= (/3* — oii—zf) + OiSi + y© 

j:(i,j)e£ 

= (Pi — Z i) + @i s i + y© 

j:(i,j)e£ 

Hence, in summary, we have 
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Oijtji — 0 . 


f3* = max{sign((3 ) © max(\f3\ — 0), 0). 

Ij 
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Notice that, G is a (continous) total variation prob¬ 
lem, which is known can be efficiently solved by para¬ 
metric flow algorithms in fChambolle and Darbon 2009[ 
Xin et al. 2014| ). Here we present a more general perspective 
of such an equivalence via conic duality, which gives us a 
natural and novel minimum quadratic cost flow formulation. 
Fast flow algorithms , suc h as but not limited to parametric 
flow ( [Gallo, Grigoriadis, and Tarja 1989| ) etc. are then eas¬ 
ily applied. For example, we show that the minimum norm 
point problem solved by parametric flow in ( [Xin et al. 2014| ) 
can be viewed as a special case of the proposed dual. 

Conic Dual to Total Variation 

To solve G we apply generalized inequalities and its cor¬ 
responding Lagrange duality introduced in (Boy d and Van- 
denberghe 2004| ). Specifically, we first define a set C such 
that C={(/3, a) e R d+1 | V(i,j), | Pi - Pj\ <a}. 

Lemma 3. The set C is a proper cone. 

This can be easily shown by checking all the properties 
required by a proper cone. See the supplementary file for 
a proof. We now consider the following problem (we keep 


using e i:j = 


for ©m 


in 


G): 


min -1|/3 — z 

v(©)e£,{/3*©R d ,c©£R} 2 


s.t. ((3 lJ ,a lJ ) G C and /?£ 
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( 6 ) 


Since (/3 U , oV) G C, then \Pi — Pj\ < off therefore ([ 6 ]) 
is indeed equivalent to (3]). Moreover, because C is a proper 
cone, we can rewrite d 6 ) as follows, 


min \\\P ' 

y(i,j)eE,{f 3^eR d ,o©eR} 2 


s.t. 


(3 lJ 

a© 


Ec 0 and f3 l p = 


ij _ J 

fe "I 


Pk 
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+ E 

k = i,j 

else 
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where (3 Ec 0 ©© P G C is defined as generalized in¬ 
equality ( [Boyd and Vandenberghe 2004| ). We call ([7]) the 
primal problem (which equals to the original TV prob¬ 
lem). Since the primal problem is both convex and satis¬ 
fies Slater’s condition, strong Lagrange duality holds (under 
generalized inequality). We define the Lagrange function as 
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c* 0 and C k J = 0 if k 7 ^ i, j, 


where (£ ZJ , r 1 ^) are the Lagrange multipliers and C* is the 
dual cone of C, defined as C* = {v | w T v > 0, Vw G C}. 
To formulate the dual problem, we take the derivative of L(-) 
with respect to the primal variables (/3, a) and we have 

(3 — z — ^ = 0 and 0^ — r^ = 0 . 


(9) 






































Figure 1: Graph structure and the flows. Each source-to-node edge 
(s,i) has 0 cost and a maximum capacity 7 \ and a minimum ca¬ 
pacity 7 i, this ensures a flow 7 i\ each node-to-node edge (i,j) has 
0 cost and a maximum capacity Oij ; each node-to-sink edge (i,t) 
has infinite capacity and a cost (in red) — (£** + 7 *)), where 
= JE ^ .^ eE Ci ■ All flows through node 2 is highlighted in 
blue for an example. 


By applying ([9]) to the dual problem is written 


as 


max.. ,,Ull z + E £ U ll2 + \h\\l 

v(i,i)e£,{^e R rf } 2 ( .^ 2 (10) 


S.t. (?3,T'3)eC*-, 7=0 


Proposition 4. G C*, Ck = 0, k 

er+ef = o, ieri < %. 

Please find the proof in the supplementary file. 
Accordingly, the dual problem becomes 


min - llz 

v(z,j)e£,{£^eR d } 2 

& - i 


E < 

(' i,ME 
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s.t. 7 = 0 , k± i,j, 7 + 7 = 0 , 171 < o ijt 


where we omit ||z||| from the objective since it is a constant 
with respect to £s and we also changed the sign of all £s for 
better illustration of the flows. 

Problem (JTTJ) can be viewed as the following minimum 
quadratic cost flow formulation, 


min 

v(i,j)eE,{e j eR d } 



( E ^+ 7 ) 11 ! 

d,j)eE 


s.t. 7 = 07 


+ 7 = o, 171 


< 0i 


IJ, 


( 12 ) 


where we have induced 7 ( 7 $ = max { \zi |, JE (i j)eE %}) 

and denote y=z + 7 to ensure y > 0 and (g ZJ + 7 ) > 0 . 
Thus each feasible £ of 0 is a possible flow on graph 
G=(V,E). Since Ci + Cf = 0. |£ u | can denote a flow 
on edge (i,j) such that 7 > 0 denotes a flow coming into 
node i and < 0 denotes a flow leaving node i. Figure [I] 
illustrates such flows by taking node 2 as an example. Thus 
to minimize the objective of ( fl2] ) is equivalent to comput¬ 
ing a minimum cost flow on this graph. Since the cost is 
quadratic with respect to the flow, this problem is_a mini¬ 
mum quadratic cost flow problem. According to ( [Hochbaum 


Table 1: Runtime (in sec.) comparison of the proposed algo¬ 
rithm with CVX. d is the data dimensionality. * indicates the 
test did not finish within 24 hours. 


d 

400 

900 

2500 

4900 

10000 

CVX 

5.22 

33.71 

889.80 

1.08e 4 

* 

Ours 

0.87 

2.39 

15.95 

64.33 

321.93 


and Hon g 1995| |Mairal et al. 201 1| ), this type of problems 
can be efficiently solved via fast flow al gorithms including 
but not limited to the parametr ic flow (Gallo, Gr igoriadis, 
and Tarja 1989[ ). Note that in ( [Xin et al. 2014| ), TV is shown 
equivalent to a minimum norm point (MNP) problem under 
submodular constraints which is solved via parametric flow. 
We now discuss the relation between the dual problem i.e. 
© or (TT| ) and the MNP considered in ( [Xin et al. 2014[ ). 

Recall that the MNP problem is defined as follows 


mm z — s 

seR d ,seB(\f c ) 


2 ? 


(13) 


where fc(S) is a cut function, defined as / C (<S) = 

JZies jev\s w ij anc * #(*) is the base polyhedron of f c . 

Proposition 5. For any minimizer of © define s such 
that s = j)eE > then s is a minimizer of (13] ). For 
any minimizer s* of there exists a decomposition such 
that s*=5^ j^ eE C J > where £ is one minimizer of ( pT| ). 

According to Prop. [5] the MNP problem i.e. © can 
be viewed as a special case of © (the conic dual), 
where j)eE C J=S - Moreover, since (JTTJ) has relatively 
“looser” constraints, it is possible to devise more efficient 
algorithms (than parametric flow) to solve fTT] ) and there¬ 
after TV. For example, in ( [Mairal et al. 201 lj ), a faster (than 
parametric flow) flow algorithm is proposed to solve their 
specific minimum quadratic flow problem. Hence, the conic 
dual perspective opens a new opportunity to solve the fa¬ 
mous TV problem more efficiently. 

Optimization summary. In summary, by applying Prop. 
[I] we can solve n 2 GFL by iteratively solving By apply¬ 
ing Prop.[2j we further reduce to the TV problem defined 
in ([3]), we then transform it to a minimum cost flow algo¬ 
rithm via conic duality and solve it by a fast flow algorithm. 

In Tab. [T] we compare the proposed algorithm with an 
off-the-shelf solver on synthetic data. We generate a random 
(3 G R d and a 2D grid graph of d nodes with each node 
having four neighbors. We then generate N = d/2 sam¬ 
ples: Xi G and yi = /3 T Xi + O.Oln^, where xi and rti 
are drawn from the standard normal distribution. All experi¬ 
ments are carried out on an Intel(R) Core(TM) i7-3770 CPU 
at 3.40GHz. The experiments show that the proposed opti¬ 
mization algorithm is more efficient and scalable. 


Application to the Diagnosis of AD 

In the diagnosis of AD, two fundamental issues are AD/NC 
(Normal Control) classification and MCI/NC (Mild Cog¬ 
nitive Impairment) classification. Let G R d be sub- 



































(a) T-test(LDA) (b) LapL (c) lasso (d) GFL (e) rrGFL 


Figure 2: Feature selection by different models. The top row illustrates selected voxels in a 3D model (voxels with positive are in brown 
and negative ones are in blue), the mid and bottom rows illustrate the corresponding projections on brain slices. 


jects’ sMRI voxels and yi = {—1,1} be the disease sta¬ 
tus (AD/NC or MCI/NC). Since the problems are classifica¬ 
tions, we use the logistic regression as the loss term 

N 

l{p) = ^2 lo S + eX P (~yi(P Tx i + c )))> (14) 

i= 1 

where c G Mis the bias parameter (to be learned). For the 
graph structure, we define each voxel as a node and their 
spatial adjacency as the edges, i.e. Wij =1 if voxels i and j 
are adjacent and 0 otherwise. The data are obtained from 
the Alzheimer’s Disease Neuroimaging Initiative (ADNI) 
databas^] We split all the baseline data into 1.5T and 3.0T 
MRI scans datasets (named 15T and 30T). 64 AD patients, 
90 NC and 208 MCI patients are included in our 15T dataset; 
66 AD patients and 110 NC are included in our 30T dataset. 
(Most 30T MCI data are in an on-going phase and are not 
included). D ata p reprocessing follows the DARTEL VBM 
pipeline (Ashbumer and others 2007| as commonly done in 
the literature. 2,527 8x8x8 mnU size voxels that have val¬ 
ues greater than 0.2 in the mean gray matter population tem¬ 
plate serve as the input features. We design experiments on 
three tasks, namely, 15ADNC, 30ADNC, 15MCINC. 

Classification Accuracy. 10-fold cross-validation (CV) 
evaluation is applied and the classification accuracy for all 
tasks are summarized in Tab. [2] Under exactly the same ex¬ 
periment setup, we compare n 2 GFL with the state-of-the- 
art classifiers: logistic regression (LR), SVM, sparse models 

2 http://adni.loni.ucla.edu 


Table 2: Classification accuracies. 



SVM 

LR 

MLDA 

LapL lasso 

GFL 

n 2 GFL 

15ADNC 

83.1 

83.1 

83.1 

84.4 

85.7 

85.1 

86.4 

30ADNC 

87.5 

86.9 

83.5 

87.5 

88.6 

85.8 

90.3 

15MCINC 

71.1 

70.5 

59.4 

70.1 

70.8 

69.8 

71.8 


e.g. the lasso and its graph Laplacian structured variants, i.e. 
the LapL, the unc onstrained GFL ( |Xin et al. 2014| ), and the 
“MLDA” model ( |Dai et al. 2012| ), which applies a variant of 
Fisher Discriminant Analysis after univariate feature selec¬ 
tion (via T-test). For each model, we used grid-search to find 
the optimal parameters respectively. Note that our accura¬ 
cies may not be superior to the recent work ( Ciu et al. 20l4| ), 
the main reason is that in ( |Liu et al. 2014| ), multi-modality 
data (including PET and sMRI data) are used. Nevertheless, 
Tab. [2] demonstrates that n 2 GFL outperforms all the other 
models using only voxel-based sMRI data. 


Feature selection. For each task, the selected features 
are those whose /3 are not zero . In Figure [2] the result of 
30ADNC is used to illustrate the feature selection by differ¬ 
ent models (using the parameters at their best accuracy). As 
shown, the selected voxels by both GFL and n 2 GFL cluster 
into several spatially connected regions, whereas those of 
lasso and T-test/MLDA scatter around. Also, as mentioned 
before, the LapL tends to select much more voxels than nec- 



















Figure 3: Stability of selected voxels across different folds of the cross-validation. The results of 5 different folds are shown in (a)-(e). The 
voxels with positive (3 are in brown, negative ones are in blue. The common/overlapped voxels selected in all 10 folds are shown in green (f). 
The top row illustrates voxels selected by the lasso model, the mid row illustrates those of GFL and the bottom row shows those of n 2 GFL. 


essary due to the 1 2 regularization. Moreover, the selected 
voxels by GFL and n 2 GFL are concentrated in Hippocam¬ 
pus, ParaHippocampal gyrus (which are believed to be the 
early damaged regions). On the other hand, the lasso and T- 
test/MLDA either select less lesion voxels or select probably 
noisy voxels not in the early damaged regions. 


Feature Stability. In Figure [3lwe show the selected vox¬ 
els across different folds of As shown, the selected 
voxels by lasso vary much across different folds, whereas 
the selected voxels by GFL are more stable. However, by as¬ 
suming the positive correlation between the features and the 
disease labels in n 2 GFL, we further increase the stability. 
To quantitatively evaluate the stability gain, we denote the 
variables of the kth fold of CV as (3(k). We introduce two 
measurements here. In ( |Yu 2013| ), the Estimation Stability 
(ES) is proposed to measure the stability of the estimation 


Table 3: Stability comparison of the models. 



lasso 

GFL 

n 2 GFL 

ES (smaller is better) 

0.035 

0.033 

0.022 

mDC (larger is better) 

0.267 

0.374 

0.644 


We denote set S(k) = {i : /%(&) ^ 0} and define mDC as 

K 

mDC = K#(n^ 1 S(k))/Y / #(S(k)), (16) 

k=l 

where # is the number of elements in a set. In Tab. [3j both 
measurements quantitatively suggest n 2 GFL obtains much 
more stable voxels due to the consideration of the correlation 
between the features and the disease labels 0 


K 

es = ^2 IIW) - mi/mwl (15) 

k=1 


where /3 = J2k=tP(k)/K. It is shown in (Yu 2013) that ES 
is a fair measurement of the estimation stability. To further 
understand the stability of feature selection, we also extend 
the Dice coefficient ( [Dice 1945) to multiple sets and apply 
the multi-set Dice Coefficient (mDC) as a measurement. 


3 Here, parameters were determined by accuracy. Similar results 
were observed using parameters producing same level of sparsity. 


Conclusions 


In this paper, we explore the nonnegative generalized fused 
lasso model to address an important problem of neuroim¬ 
age analysis, i.e. the stability of feature selection. Experi¬ 
ments show that our model greatly improves the stabilities 


4 We notice that, in (Xin et al. 2014), the stability is computed 
using the top 50 positive voxels because these voxels are believe 
to be the most atrophied ones. By computing the stability of all 
non-zero voxels, the mDC of GFL drops around 30%. This clearly 
shows that the instability is caused largely by the undesirable vox¬ 
els that disagree with the correlation prior (those scattered blue 
voxels in the mid row). 

















of feature selection over existing methods for brain image 
analysis. Although n 2 GFL is applied to the diagnosis of AD 
problem, it can be applied to solve more general problems. 
Moreover, we believe that the theoretical points made here 
e.g. nonnegative FISTA, soft-thresholding and the conic dual 
of TV, provide motivation for future work of general interest. 
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